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Abstract 

A many to one discrete auditory transform is presented to map 
a sound signal to a perceptually meaningful spectrum on the scale 
of human auditory filter band widths (critical bands). A generalized 
inverse is constructed in closed analytical form, preserving the band 
energy and band signal to noise ratio of the input sound signal. The 
forward and inverse transforms can be implemented in real time. Ex- 
periments on speech and music segments show that the inversion gives 
a perceptually equivalent though mathematically different sound from 
the input. 
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1 Introduction 



Short term discrete Fourier transform (DFT) is a common tool to map sound 
signals from time domain to spectral domain for analysis and synthesis ^J. 
However, the spectral resolution of DFT over a standard short time window of 
5 to 15 milleseconds (ms) in duration is much more refined than the resolution 
of human auditory filters that have band widths referred to as critical bands 
|31 ^]. Critical bands are nearly uniform in widths similar to DFT for 
frequencies under 500 Hz, yet the widths increase rapidly towards higher 
frequencies. The nonuniform frequency resolution of the ear resembles that 
of wavelets [21 E] , though critical band widths do not follow a simple power 
law, and auditory filter shapes may not obey the requirements of the wavelet 
basis functions. An orthogonal discrete transform with broader and smoother 
spectrum towards higher frequencies than that of DFT is recently constructed 
[TT] to mimic the auditory filtering. Due to the limitation of orthogonality, 
the variation of the spectrum does not match the scale of critical bands. In 
addition, the spectrum of the transform does not carry enough perceptual 
meaning and so makes it inconvenient to perform psychoacoustically based 
spectral analysis and processing. 

In this paper, we present a novel many-to-one discrete auditory transform 
(MDAT) that maps sound signals from the time domain to a perceptually 
meaningful spectral domain on the scale of critical bands. The many-to- 
one mapping is consistent with the fact that physically and mathematically 
different signals can sound the same to human ears |31 |Hl . The frequency 
resolution for the perception of sound in our brain is much lower than that 
is required to fully describe a signal mathematically [H]. The perception 
variables of MDAT are band energies and band signal to noise ratios (SNRs), 
motivated by perceptual coding in AAC and MPS technology of digital music 
compression OIH]. The SNRs depend on two neighboring frames of a signal 
and so MDAT spectrum also encodes temporal information, different from 
DFT. As a test of the efficiency of these variables, and for the synthesis of 
sounds post spectral processing, we show how to construct an inverse which 
is perceptually equivalent to the input sound though mathematically not 
identical. Both the forward and inverse operations are in closed analytical 
form, and allow real time implementation of the resulting algorithms. 

Compared with DFT (implemented by FFT), MDAT has better temporal 
resolution due to its lower spectral resolution in higher frequencies. In terms 
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of the 256 point FFT used in this paper, the number of frequency bands of 
MDAT is in the 40's (see Tables 1 and 2 for signals with different sampling 
frequencies) while DFT has 128 frequency components. Compared to time 
domain filter bank with a relatively small number of band pass filters (4 to 16 
channels) as in body- worn hearing devices [3], MDAT has better frequency 
resolution yet does not have the delays encountered when a larger number of 
frequency separating band pass filters are needed. Hence MDAT is expected 
to be a useful tool in applications where a spectral processing strategy is 
necessary on the critical band scale, and a trade-off of spectral accuracy and 
temporal precision is to be optimized. 

The paper is organized as follows. In section 2, the MDAT is formu- 
lated and the associated perceptual variables are defined. Then an inverse is 
constructed in closed analytical form based on band energies and SNRs. In 
section 3, MDAT is applied to speech (sampled at 16 kHz) and music (sam- 
pled at 44.1 kHz) signals, and properties of perceptual spectral variables are 
illustrated. The reconstructed signals are compared with the input signals 
both spectrally and in waveforms, and these signals can be heard at author's 
website [TU]. Section 4 contains discussion and conclusion. 

2 MDAT and a Perceptual Inversion 

Let s = (so, • • • , sat-i) be a discrete real signal, the discrete Fourier transform 
(DFT) is P: 



The DFT is implemented by the fast Fourier transform (FFT) algorithm, we 
shall refer to the k = component of DFT as DC (direct current) and the 
other components as AC (alternating current) for short. 

Let us further map the s^'s to a spectral domain of lower resolution 
where perception variables can be better defined. Such a spectral domain is 
obtained from binning the DFT components into bands of various widths, 
similar to the critical band width distribution of human auditory filters. The 
detailed partition of DFT components, the band widths, and psychoacoustic 
bark values of the bands are listed in Table 1 and Table 2. Table 1 is at 
sampling frequency Fs = 16 kHz for speech sounds, and Table 2 is at Fs = 
44.1 kHz for music sounds. Let b = 1, 2, ■ ■ ■ , J, denote the number of bands. 




(2.1) 



n=0 
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and let B{b) denote the DFT wave numbers k in the 6-th band. In case of 
Table 1, N/J ^ 5.56; and in Table 2, N/J ^ 6.24. 
The signal energy in the b-th band is: 

keB{b) 

Let snr^ be the signal to noise ratio (SNR) in the b-th band, the perception 
domain consists of nonnegative 2 J-dimensional vectors whose components 
are band energies and band SNRs: 

Vperc = {(e(l), snr\ e(2), W, ■ ■ ■ , e( J), snr-^)}. (2.3) 

The snr^ are calculated following the AAC coding [5], an improvement of 
MP3 coding Let r{k,t) and f{k,t) be the amplitude and phase of s{k) 
at time frame t denoted by s{k,t). The predicted amplitude and phase at 
time frame t are: 

Tpredik, t) = r{k, t - 1) + Ar, Ar = r(fc, t - 1) - r{k, t-2), 
fpredik, t) = f{k, t-l) + A/, A/ = f{k, t-1)- f{k, t-2). (2.4) 

The unpredictability measure of the signal, a quantity for measuring the 
noisy (uncertain) part of signal, is: 

^(^^ ^ abs(g(fc,t) - Spred{k,t)) ^2 
abs(s(A;, t)) + abs{spred{k, t)) ' 

where Spred{k,t) = rprecz(^) 6*-^*"'='* It is clear that c{k,t) G [0,1]. Note 
that c{k,t) encodes the time domain information of the signal s, which is 
not available in DFT. As a result, the perceptual variables (|2.3p has both 
spectral and temporal information of the input signal. We shall omit the t 
dependence from now on, as all subsequent operations will not explicitly use 
t. 

The weighted unpredictability measure is: 

ec{b) = J2 r\k)c{k). (2.6) 
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Next, convolve e{b) and ec{b) with spreading functions jH] on the bark scale 
m as: 

J 

ecb{b) = ^ e(6') spread (bark (6'), bark (6)), (2.7) 
b'=i 
J 

ct{b) = ^ ec(6') spread (bark (&'), bark (6)), (2.8) 

b'=l 

where bark(6) is the bark value of the b-th partition (band). The bark 
scale 13] is nearly uniform on the logarithmic frequency scale. The spreading 
functions [HI carry the shape information of human auditory filters. 
Normalizing ct by energy ecb gives: 

cb{b) = ct{b)/ecb{b), (2.9) 

a noise to signal ratio, which in turn defines tonality index as: 

tb{b) = -0.299 - 0.43 log(c6(6)), (2.10) 

if the value is in (0, 1), otherwise equal to zero if the value is below zero, or 
one if the value is above 1. Finally, the signal to noise ratio in decibel (dB) 
is: 

snr'' = tb{b) TMN + (1 - tb{h)) NMT, (2.11) 

where TMN = 18 dB (tone masking noise), NMT = 6 dB (noise masking 
tone). The forward transform denoted by T from signal s to its image in the 
perception domain V^erc is a many-to-one mapping. Clearly, T(— s) = Ts. 

We notice that each snr'' is a monotone function of cb{b) which in turn de- 
pends on ec{b) and ct{b). So two other ways of characterizing the perception 
domain are: 

V^^rc = {(e(l), c6(l), e(2), c6(2), ■ ■ ■ , e( J), c6( J))}, (2.12) 

VSl = {(e(l), ec(l), e(2), ec(2), ■ ■ ■ , e( J), ec( J))}. (2.13) 

In other words, Vpl}c or Vperc is sufficient to describe the perception variables, 
i.e. the band energies and band SNRs. Below we show how to reconstruct a 
sound signal from VpDc or Vpe}c and obtain a perceptually equivalent inverse. 
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The inversion from a subset of 2 J dimensional space to the signal space 
{N > 2J) is non-unique. The inversion is through reconstructing the 
DFT vector Sk- Let us write the reconstructed DFT vector as: 



tti. = w 



'e'/\b)e''PK keB{b), (2.14) 



where the real weighting factors satisfy for all b: 

E l^^l' = l' (2-15) 

k£B{b) 

to preserve the band energy e{b). The real phase factors <y9^, and the DC 
component of DFT are assumed to be known for the reconstruction of the 
AC part of the DFT amplitude. 

The second conserved quantity (constraint) is ec(6) in ()2.6|) : 

ec{b)= J2 U\' e{b) c{k) = e{b) ^ \wl\'c{k). (2.16) 

keB{b) keB{b) 

Define: 

<w'>l= J2 (2.17) 

keB{b) 

which equals 

< w'' >l= ec{b)/e{b) G ( min c{k), max c{k)) C [0, 1]. (2.18) 

k&B(b) k&B(b) 

If the inversion is from Vpelc, then the two spectral constraints ()2.15p and 
(|2.17p are available to be imposed in each band containing at least two DFT 
components. If the inversion is from VpDp, then < w'^ >1 has to be recovered 
from e{b) and cb{b). By (j2.9j) . we have for each b G [1, J\. 

^^^^^ ^ ELi em < w'' >l spread(bark(y),bark(&)) 19) 
Ylib'=i ^(^') spread(bark(6'), bark(6)) 



or 



e{b') <w^' >l spread(bark(6'),bark(6)) 

6'=1 

J 

cb{b) e{b') spread(bark(6'), bark(6)). (2.20) 



Equation (j2.2(J|) can be recast as a matrix equation Sx = z, where S = 
(spread(bark(&'), bark(6)) is a square matrix, x is the column vector with 
entries e{b) < z the right hand side column vector. The commonly 

used spreading matrix S (based on e.g. Schroeder's spreading functions jH]) 
does not have a nonnegative inverse. In order to find nonnegative solutions 
in general, one may solve a quadratic programming problem from (j2.19p . 
Define the matrix Q = {qij) with its entries: 

e(j) spread(bark(j), bark(z)) 



X]j=i ^0) spread(bark(j), bark(z)) 

The matrix Q is invertible. A column vector y = {< >1) is sought to 
minimize the norm ||c6 — Qy\\2 subject to the constraint yl{h) < y{b) < 
yu{b), yl{h) = miUkeBib) c{k), yu{h) = maXkeB{b) c{k). 

In signal processing tasks that keep the band SNRs invariant as in hearing 
aids gain prescriptions, the quadratic programming is not needed, directly 
inverting S will suffice to find (< >1). 

Next we solve for w\, from the two equations ()2.15|) and ()2.17|) . using 
information of c(A;), A; G B{h). Let A*";, be the number of DFT components 
in 5(6), p= (|<|M<P,---,KJT, ^= (c(fcO,c(M,---,c(fcA.Jf, 
kj e 5(6), 9b =< >l, e = (1, 1, ■ ■ ■ , 1)'^ e -R^^ T denoting transpose. 
Equations (j2.15|) and (j2.17j) now read (dot refers to inner product): 

e - p = 1, (2.21) 
i^- p = 9b. (2.22) 

If ip is parallel to e, equation (j2.22j) is redundant with 9b = c(l) by 
definition and equation (j2.21|) . This is true in particular if A*";, = 1. The 
simplest smooth solution to ()2.21|) is p = -^e. 

If Afc > 2 and is not parallel to e, define vector: 

v = e-^^iJ^O, (2.23) 
e ■ ip 

clearly v ■ e = 0, and e ■■?/'> 0. Equations 1)2.211) and ()2.22|) imply that: 

iJ. ^=1- iL^Ob. (2.24) 
e ■ ip 
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Equation (|2.2H) and equation (j2.24p say that in the orthonormal basis 
with e and v as two directions, the coordinates along e and v are constrained, 
the other coordinates are free. The simplest two dimensional solution is ob- 
tained by setting the free coordinates to zero (|| ■ II2, norm or the Euclidean 
distance): 

P=^^ + - ^Ot) (2.25) 

which becomes upon substituting in ()2.2Hj) : 



P 



\\v\\i V e-ijj 



e-^^ l-^^O^- (2-26) 
e ■ ip \\v\\2 V e ■ ip 



The regularity of solution ()2.25|1 or ()2.2(i|l is no worse than that of which 
is oscillatory in general. With the ty^'s so determined, a time domain signal 
is reconstructed by inverse DFT using the reconstructed a\., k G B{b), b = 
1,2,--- , J. 

If A'';, = 2, ()2.26|1 is the unique solution. If A^";, > 3 (true if b is above 
some critical number, see Table 1 and Table 2), there are infinitely many 
solutions to ()2.21|1 - ()2.22|1 . It is desirable to seek a smoother solution because 
spectral smoothness improves temporal localization of the inverse transform. 
One way to obtain a smoother solution over the frequency bands {Nh > 3) 
starting with FFT wave number ko is to minimize the following quadratic 
function: 

/=2 E (P'^+i-P^)'' (2-27) 

where M + 1 is the total number of DFT components in those bands B{b) 
with Nf, > 3, subject to the two constraints ()2.2H) - ()2.22|) in each such band 
B{b). Let u = {pko, ■ ■ ■ , pko+KiY, and define: 

g,{u) = -1+ 5Z Pfc, (2.28) 

k<^B{b) 

hiu) = -Ob+ ''^P'^^ (2.29) 

fcGS(b) 

then the constraints are of the form Qb = ^ and = 0. The minimizer can 
be approached as a steady state in a constrained gradient descent method 
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(7j. Let u = u{t) solve the equation: 



Mt = -Vs/- ^ Xb^nQb- ^ Vb'^uh, (2.30) 

where the Lagrange multiphers Xb and rji, are chosen so that the constraints 
in each band are satisfied: 

-^9b[u) = Vitgb-ut 

= -^uQb-'^uf - W^ugb? -rib^ugb-^uhb = ^, (2.31) 



dr 

= -Vuhb-Vitf-XbVuhb-Vitgb-'nb\Vithb\^ = ^- (2.32) 

We have used the fact that V^/ib or Vugb only have nonzero components 
in the band B{h). To solve ()2.3ip - ()2.32p band by band, it is convenient to 
consider 

4 = 1- ^ , k e B{b). (2.33) 

If Cfc = 0, for all k G B{b), then the second constraint /if, = is redundant, 
rjb = 0, and 

A, = (2-34) 
If Cfc 7^ 0, for some k G -B(&), replace the constraint /if, = by: 

hb{u)= CkPk-l + ^ ^ = 0. (2.35) 



keB{b) 



Then the u equation is ()2.30|) with hf, in place of hf,. Due to gb ■ V^/ hf, = 0, 
Xb is as given in ()2.34|) . and: 

Vt//ife ■ V^/ 

^6 = 1^ r I, , (2.36) 

where |Vc^bp = Efceij(i,) and \V^gb\^ = Nb in (j2.34;i. 
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Finally, let us put the u equation in matrix form. Let A be the symmetric 
tridiagonal matrix with I's on the off-diagonals, and (—1, —2, ■ ■ ■ , —2, — 1) 
on the diagonal (■ ■ ■ refer to — 2's), then V^/ = Au. Let R be the block 
diagonal matrix where each block is the symmetric Ni, x A*";, matrix with the 
(i, j')-th entry being A*"."^ + ^ '^''^^ -2 . If uaR(h\ c? is zero, the second term 

in the sum is understood to be absent. The matrix form of u equation is 
(J the identity matrix): Ur = {I — R)Au, whose solution is in closed form 
u{t) = exp{(J — R)At}uo. The initial data uq is given by the values of pk^, 
• • ■ , Pko+M in the explicit formula ()2.26|) . 



3 Numerical Experiments 

The forward and inverse transforms are implemented with the 256 point FFT. 
For speech signals. Table 1 is used at sampling frequency 16 kHz. For music 
signals. Table 2 is used at sampling frequency 44.1 kHz. Top (bottom) panel 
of Figure 1 shows the oscillatory unpredicatibility measure c{k) of a speech 
(music) frame. Top (bottom) panel of Figure 2 is the corresponding weighted 
unpredicatibility measure ec{b) for the speech (music) frame, oscillation is 
slower over the coarser scale b. In Figure 3 (Figure 4), we compare the 
original and reconstructed FFT amplitude spectra {k e [20, 128]) of a speech 
(music) frame. The difference is negligible for k G [0,20]. We see that the 
reconstructed FFT spectra captured well the upper envelope of the original 
FFT spectra of the speech frame. For the music frame, much more details of 
the FFT spectra are recovered. Except for a mismatched peak and a valley 
over k G [20,40], the dashed and solid curves nearly agree. If one zooms 
in further, one may see differences over smaller scales yet the reconstructed 
(dashed) curve again keeps track of the envelope of the original spectral shape 
well. Figure 5 compares the smoother spectral solution (r = 2, dashed) with 
the simple solution (r = 0, solid) in case of a speech frame over k G [30, 128] 
where constrained optimization (smoothing) takes place. The steady state is 
almost approached at r = 2. The smoothing is similar for music frames. 

Figure 6 (Figure 7) compares the original and reconstructed speech (mu- 
sic) waveforms. The total relative error for the speech signal in Figure 6 is 
12 %, and is only 1.5% for music signal of Figure 7. This is consistent with 
the better spectral fit of Figure 4 than that of Figure 3. The improvement 
by the optimization (|2.27|) - (j2.29j) is however found to be minor both in terms 
of the relative P error of reconstructed signals and perceptual difference in 
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hearing the signals. The optimization step may be helpful however in other 
signal processing tasks to be evaluated in the future. 

The original and reconstructed (r = 0) speech (music) signals in Fig- 
ure 6 and Figure 7 can be heard at http:/ / math. uci.edu/ ^jxin/ sounds. html. 
Inspite of the errors (loss) incurred in the reconstruction, there is very lit- 
tle perceptual difference between the original and the reconstructed signals, 
thanks to the masking effects present in the human ears [Hj. Hence we have 
achieved the perceptually equivalent inversion of the many-to-one transform. 

4 Discussion and Conclusion 

A many-to-one auditory transform is introduced so that the resulting spec- 
trum, especially towards the higher frequency regime, is much less refined 
than the FFT spectrum, yet just enough to resolve the band widths of human 
auditory filters (critical bands). A reconstruction of perceptually equivalent 
inverse is given so that the inverted signal makes little perceptual difference 
from the input signal even though there is a loss mathematically. The inver- 
sion preserves the band energies and band signal to noise ratios, which prove 
to be essential in capturing the perception of sounds. Both the forward and 
inverse transforms are in closed analytical form and can be carried out in real 
time. Test examples on speech and music signals illustrated the properties 
of the transform and its inversion. The transform is a promising new tool 
for sound compensation or enhancement that requires spectral manipulations 
over the scale of critical bands. 

A future study may concern with more accurate inversion while conserv- 
ing additional spectral information of the signal, such as energy variation 
about its mean value e{b)/Nh inside each frequency band with A^^ > 3. An- 
other is to further develop MDAT in specific applications such as hearing 
aids and hearing implants. 
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Figure 1: Top panel: unpredicatibility measure c{k) of a speech frame, illus- 
trating its oscillatory nature in FFT wave number k, k ^ [0, 128]. Bottom 
panel: unpredicatibility measure c{k) of a music frame, k G [0, 128]. 
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Figure 2: Top panel: weighted unpredicatibility measure ec{b) of a speech 
frame, b G [0,46]. Bottom panel: weighted unpredicatibility measure ec{b) 
of a music frame, b G [0,41]. 
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Figure 4: Original (solid) and reconstructed (dashed, r = 0) FFT amplitude 
spectra of a music frame. 
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Figure 5: Comparison of reconstructed (solid, r = 0) and (dashed, r = 2) 
FFT amplitude spectra of a speech frame. The dashed curve is smoother 
while satisfying the same spectral constraints. 
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Figure 6: Comparison of the input (top) and reconstructed (bottom, 
speech signals in waveforms. 
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Figure 7: Comparison of the input (top) and reconstructed (bottom, 
music signals in waveforms. 
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Table 1: Partition and psychoacoustic parameters for the 256 point FFT at 
16 kHz sampling frequency. The columns are (from left to right) band index, 
low FFT index of the band, high FFT index of the band, number of FFT 
components in the band (width), the bark value of the band. The symmetric 
part of the AC components of FFT are not listed. Zero index refers to DC 
component of FFT. 
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Table 1: Continued. 
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Table 2: Partition and psychoacoustic parameters for the 256 point FFT 
at 44.1 kHz sampling frequency. The columns are (from left to right) band 
index, low FFT index of the band, high FFT index of the band, number 
of FFT components in the band (width), the bark value of the band. The 
symmetric part of the AC components of FFT are not listed. Zero index 
refers to DC component of FFT. 
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Table 2: Continued. 



Band Index 


Low FFT Index 


High FFT Index 


Width 


Bark Value 


28 


41 


43 


3 


20.62 


29 


44 


47 


4 


21.01 


30 


48 


51 
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21.43 


31 


52 


55 


4 


21.81 
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